[Data] speedup checkpoint filter 5x#60002
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization for checkpoint filtering by converting checkpointed IDs to a NumPy array once, rather than for every block. The changes are well-implemented and consistent across the modified files. My review includes a couple of suggestions to enhance code clarity and maintainability.
| combined_ckpt_block = transform_pyarrow.combine_chunks(pyarrow_checkpointed_ids) | ||
|
|
||
| combine_ckpt_chunks = combined_ckpt_block[ID_COL].chunks | ||
| assert len(combine_ckpt_chunks) == 1 | ||
| # Convert checkpoint chunk to numpy for fast search. | ||
| # Use internal helper function for consistency and robustness (handles null-typed arrays, etc.) | ||
| ckpt_ids = transform_pyarrow.to_numpy(combine_ckpt_chunks[0], zero_copy_only=False) |
There was a problem hiding this comment.
This logic for converting a pyarrow Table to a numpy array of IDs is duplicated from _combine_chunks in checkpoint_filter.py. To improve maintainability, consider extracting this logic into a non-remote helper function in checkpoint_filter.py and calling it from both _combine_chunks and this test. This would avoid having to update the logic in two places if it ever changes.
|
This is nice. Some optimizations can be considered for future PRs:
Just to understand better, this is the total time spent in the filter function, right? |
@wingkitlee0 yes, this is the total time spent in the filter function. This PR addresses the time overhead caused by repeated copies from |
c6d23db to
0fb1a5d
Compare
8282751 to
8be95cf
Compare
|
seems kind of messy, i will split this into 3 pr |
c4eab5f to
0e213dc
Compare
Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com> keep this pr simple Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com>
0e213dc to
b0142c2
Compare
|
moved to #60294 |
Modification
I'm using Ray Data's checkpoint. My data has 115 million records, with primary key {"id": str}. When I use Checkpoint to filter the input blocks, it takes several hours.
I checked the performance bottleneck and found it occurs in the
filter_with_ckpt_chunkfunction in checkpoint_filter.py. I add some logs:the
ckpt_chunkhas shape (115022113), and block_ids has shape (14534). I got:We can see from the perf test that:
ckpt_chunkshas only one chunk because we has combined chunks _combine_chunksckpt_chunkis a very large chunk that holds 115 millon ids, convert it from pyarrow to numpy will costs 6sckpt_ids = transform_pyarrow.to_numpy(ckpt_chunk, zero_copy_only=False)is executed once, causing a large time overhead.This PR obtains the
ckpt_idnumpy array in advance, avoiding multiple calls. In my tests, this can reduce the filtering time from 5 hours to 40 minutes.Notes:
In this PR, each read task needs to read the ckpt_ids(numpy.ndarray) from the object store, rather than Arrow format. This increases I/O and memory overhead because Arrow arrays usually costs less space. In my experiment, the pyarrow array(115 million rows, string-typed) used 1.7 GB of memory, while the numpy array used 9 GB. However, I this this memory overhead is acceptable because of the performance improvement.